Uncertainty Coefficient
   HOME

TheInfoList



OR:

In statistics, the uncertainty coefficient, also called proficiency, entropy coefficient or Theil's U, is a measure of nominal
association Association may refer to: *Club (organization), an association of two or more people united by a common interest or goal *Trade association, an organization founded and funded by businesses that operate in a specific industry *Voluntary associatio ...
. It was first introduced by Henri Theil and is based on the concept of
information entropy In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable X, which takes values in the alphabet \ ...
.


Definition

Suppose we have samples of two discrete random variables, ''X'' and ''Y''. By constructing the joint distribution, , from which we can calculate the conditional distributions, and , and calculating the various entropies, we can determine the degree of association between the two variables. The entropy of a single distribution is given as: : H(X)= -\sum_x P_X(x) \log P_X(x) , while the
conditional entropy In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable Y given that the value of another random variable X is known. Here, information is measured in shannons, na ...
is given as: : H(X, Y) = -\sum_ P_(x,~y) \log P_(x, y) . The uncertainty coefficient or proficiency is defined as: : U(X, Y) = \frac = \frac , and tells us: given ''Y'', what fraction of the bits of ''X'' can we predict? In this case we can think of ''X'' as containing the total information, and of ''Y'' as allowing one to predict part of such information. The above expression makes clear that the uncertainty coefficient is a normalised
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
''I(X;Y)''. In particular, the uncertainty coefficient ranges in
, 1 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
as ''I(X;Y) < H(X)'' and both ''I(X,Y)'' and ''H(X)'' are positive or null. Note that the value of ''U'' (but not ''H''!) is independent of the base of the ''log'' since all logarithms are proportional. The uncertainty coefficient is useful for measuring the validity of a statistical classification algorithm and has the advantage over simpler accuracy measures such as precision and recall in that it is not affected by the relative fractions of the different classes, i.e., ''P''(''x''). It also has the unique property that it won't penalize an algorithm for predicting the wrong classes, so long as it does so consistently (i.e., it simply rearranges the classes). This is useful in evaluating clustering algorithms since cluster labels typically have no particular ordering.


Variations

The uncertainty coefficient is not symmetric with respect to the roles of ''X'' and ''Y''. The roles can be reversed and a symmetrical measure thus defined as a weighted average between the two: : \begin U(X,~Y) & = \frac \\ pt& = 2 \left frac \right . \end Although normally applied to discrete variables, the uncertainty coefficient can be extended to continuous variables using
density estimation In statistics, probability density estimation or simply density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought of ...
.{{citation needed, date=July 2012


See also

*
Mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
*
Rand index The RAND Corporation (from the phrase "research and development") is an American nonprofit global policy think tank created in 1948 by Douglas Aircraft Company to offer research and analysis to the United States Armed Forces. It is financed ...
*
F-score In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the n ...
*
Binary classification Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include: * Medical testing to determine if a patient has c ...


References


External links


libagf
Includes software for calculating uncertainty coefficients. Statistical ratios Summary statistics for contingency tables Information theory Statistics articles needing expert attention